Skip to content

Endpoint and timeout fixes for sharded-CI flakes#621

Merged
rockbmb merged 8 commits into
masterfrom
drop-onfinality-collectives
May 20, 2026
Merged

Endpoint and timeout fixes for sharded-CI flakes#621
rockbmb merged 8 commits into
masterfrom
drop-onfinality-collectives

Conversation

@rockbmb

@rockbmb rockbmb commented May 19, 2026

Copy link
Copy Markdown
Collaborator

A batch of CI-reliability fixes surfaced by sharded test runs (see polkadot-fellows/runtimes#1180, which consumes PET via runtimes-master).

Endpoint changes

  • Drop wss://collectives.api.onfinality.io/public-ws from collectivesPolkadot. The public-tier endpoint returns -32029: Too Many Requests under sustained load.
  • Replace dead wss://us.bifrost-rpc.liebi.com/ws (the only configured endpoint for bifrostKusama) with hk. plus the no-region Liebi default.
  • Refresh KNOWN_GOOD_BLOCK_NUMBERS_*.env. The previous bump shipped a stale Bifrost Kusama fallback because the dead us. endpoint blocked yarn update-known-good.

Timeouts

defineChain.ts raises the per-chain timeout to 90s. SetupOption.timeout in chopsticks-utils only governs the test-side WsProvider; chopsticks' upstream WsProvider has a separate rpc-timeout that has no path from SetupOption. This PR bundles a .yarn/patches patch that adds an rpcTimeout field to SetupOption and forwards it as rpc-timeout, mirroring AcalaNetwork/chopsticks#1034. The patch can be dropped once a chopsticks release includes that change.

Exclusions

  • bifrostKusama.* and karura.bifrostKusama.xcm.test.ts: every public Bifrost Kusama RPC either rejects connections or prunes state at the pinned block. Excluded until a workable endpoint set exists.
  • acala.*.test.ts: Subway hardcodes its per-upstream request_timeout to 30s and doesn't expose it in ClientConfig, so heavy Acala storage queries force Subway to cycle through the 3 Liebi endpoints without serving a response. AcalaNetwork/subway#203 adds the missing field and is merged but pending a fresh tag with a working release artifact; the exclusion can be reverted once that lands.

The public-tier endpoint is rate-limited (RPC error -32029,
"Please apply an OnFinality API key") under the sustained load
produced by sharded CI runs, observed in polkadot-fellows/runtimes#1180.
@github-actions

Copy link
Copy Markdown
Contributor

No issues found.

@rockbmb rockbmb self-assigned this May 19, 2026
@rockbmb rockbmb added the ci label May 19, 2026
Acala XCM tests (acala.astar, acala.bifrostPolkadot, etc.) hit the
60s timeout on every Acala endpoint Subway cycles through, on the
same shard that surfaced the OnFinality rate-limit failure in
polkadot-fellows/runtimes#1180. The Acala public RPC pool is slow
enough under load that the heavy XCM-Transact storage queries
don't return in 60s on any individual endpoint, so Subway burns
the full timeout per upstream before rotating, never gets a
response, and the test fails.

90s gives those queries enough headroom while still capping a
genuinely stuck call. Block numbers are bumped at the same time
to keep state lookups close to chain head.
@rockbmb rockbmb requested a review from xlc May 19, 2026 23:35
rockbmb added 2 commits May 20, 2026 00:07
`wss://us.bifrost-rpc.liebi.com/ws` was the only endpoint for
bifrostKusama and is currently network-dead (handshake timeout,
probed live). Test runs stall on bifrostKusama because Subway has
no fallback to cycle to.

The two replacements (`hk.` and the no-region default) both
respond and serve state at current tip; the no-region host is
Liebi's DNS-load-balanced entrypoint and adds geographic
redundancy in case `hk.` ever goes the way of `us.`.
`ba34d62` shipped `BIFROSTKUSAMA_BLOCK_NUMBER=13903082` as a stale
script fallback because the only configured Bifrost Kusama endpoint
was unreachable at the time, and no public RPC retained that block's
state. The previous commit fixes the endpoint; this re-runs
`yarn update-known-good` against the live endpoint to record a block
number that is actually servable, and refreshes every other chain's
block in the same pass.
@rockbmb rockbmb changed the title Drop public OnFinality endpoint for collectivesPolkadot Endpoint and timeout fixes for sharded-CI flakes May 20, 2026
rockbmb added 3 commits May 20, 2026 01:08
`defineChain.ts` already set `timeout: 90_000` in the per-chain
chopsticks config, but `SetupOption.timeout` only controls the
test-side WsProvider that talks to the in-process chopsticks server;
it leaves chopsticks' own upstream WsProvider on its 60s default,
which is what produces the `No response received from RPC endpoint
in 60s` errors seen on Acala in the previous CI run on this branch.

Bundles a yarn patch that adds an `rpcTimeout` field to
`SetupOption` and forwards it as `rpc-timeout` in the chopsticks
config (mirroring AcalaNetwork/chopsticks#1034), and sets it to 90s
in `defineChain.ts`. The patch can be dropped once a chopsticks
release includes #1034.
`wss://us.bifrost-rpc.liebi.com/ws` (only configured endpoint until
the previous commit) is network-dead, and the alternative Liebi hosts
(`hk.`, no-region) only serve current-tip state; they don't retain
the historical state at the block PET pins to, so chopsticks setup
fails with `UnknownBlock: State already discarded` on every fresh
shard.

Until a public Bifrost Kusama endpoint retains state at our pinned
block (or PET runs against an archive-quality endpoint operator
specifically), the four `bifrostKusama.*` E2E suites and the cross-
chain `karura.bifrostKusama.xcm` suite are excluded from collection.
The other Kusama suites are unaffected.
The chopsticks-side patch in this PR raised `rpcTimeout` to 90s, but
Subway hardcodes its own per-upstream `request_timeout` to 30s (with
no field exposed in `ClientConfig` to override it). Heavy Acala
storage queries take longer than 30s, so Subway cycles through the 3
Liebi endpoints (~30s each) without serving a response, and chopsticks
times out before the cycle completes.

Excluding Acala suites until Subway exposes `request_timeout` as a
config field.
xlc added a commit to AcalaNetwork/subway that referenced this pull request May 20, 2026
* Expose per-upstream client timeouts and retries in `ClientConfig`

`Client::new` already accepts `request_timeout`, `connection_timeout`,
and `retries` arguments, but `from_config` hardcodes all three to
`None` because `ClientConfig` only exposes `endpoints` and
`shuffle_endpoints`. As a result the only way to override the 30s
per-upstream request timeout (and the 30s connection timeout, and the
default retry count) is to construct `Client` directly in Rust, which
isn't reachable from the YAML-driven config.

Adds three optional fields to `ClientConfig`:

  - `request_timeout_seconds`
  - `connection_timeout_seconds`
  - `retries`

`from_config` plumbs them into `Client::new`. None of the existing
defaults change when the fields are omitted.

The motivating case is heavy storage queries against slow public RPCs
(Acala under load is the case that surfaced this in
`polkadot-fellows/runtimes#1180` /
`open-web3-stack/polkadot-ecosystem-tests#621`) where 30s per upstream
is not enough and Subway exhausts its endpoint cycle without serving
a response.

* cargo fmt

* feat(bench): Add client config options for connection timeout, request timeout, and retries

---------

Co-authored-by: Bryan Chen <xlchen1291@gmail.com>
@rockbmb rockbmb force-pushed the drop-onfinality-collectives branch from e99a143 to dbb5d6d Compare May 20, 2026 12:30
@rockbmb rockbmb force-pushed the drop-onfinality-collectives branch from dbb5d6d to f6c715a Compare May 20, 2026 12:37
@rockbmb rockbmb merged commit 63a09c8 into master May 20, 2026
13 checks passed
@rockbmb rockbmb deleted the drop-onfinality-collectives branch May 20, 2026 13:08
rockbmb added a commit that referenced this pull request May 20, 2026
`request_timeout_seconds: 90` on Subway's upstream client (added to
`subway-template.yml` in the previous commit) gives Subway enough
time per upstream attempt for Acala storage queries to land before
the 30s default forced it to cycle endpoints. The exclusion added in
PR #621 is no longer needed and is removed; the exclusion comment is
narrowed to bifrostKusama, which still lacks a workable endpoint set.
rockbmb added a commit that referenced this pull request May 20, 2026
…pstream timeout (#622)

* Install Subway from upstream `v0.1.0` musl release in `ci.yml`

Switches `cargo install --git` to a `curl | tar -xz` of the
released static binary
(https://github.com/AcalaNetwork/subway/releases/tag/v0.1.0,
published by AcalaNetwork/subway#202). Removes the Rust
toolchain install, Subway-HEAD commit-hash lookup, and
Swatinem cache layer that existed only to amortise the
`cargo install` cost — none of them have any other
consumer in this workflow.

* Install Subway from upstream `v0.1.0` musl release in `update-known-good.yml`

Same swap as the previous commit, applied to the periodic block-number
update workflow.

* Install Subway from upstream `v0.1.0` musl release in `update-snapshot.yml`

Same swap as the previous two commits, applied to the snapshot-update
workflow.

* Fail Subway download fast on HTTP errors (`curl -f`)

Without `-f`, an HTTP 4xx/5xx response (e.g. release deleted, GitHub
degraded) leaves `curl` exiting zero with the error body on stdout,
and the downstream `tar -xz` fails with a confusing "not in gzip
format" message instead. Per review on PR #622.

* Install Subway by extracting binary from `acala/subway:v0.1.1` Docker image

The `v0.1.1` GitHub Release at AcalaNetwork/subway is missing its
`x86_64-unknown-linux-musl.tar.gz` asset; the release workflow's
`Build release binary` step failed (`cargo build --locked` mismatched
the bumped `Cargo.toml` version), so the upload was skipped. The
upstream tag still produces a working Docker image because
`docker.yml` doesn't use `--locked`, so `acala/subway:v0.1.1` is
the only working consumption path for v0.1.1.

The image's binary lives at `/usr/local/bin/subway` (per Subway's
Dockerfile); copying it out with `docker create` + `docker cp` lands
in roughly the same wall time as the curl-and-untar path and unblocks
consumption of PR #203's `request_timeout_seconds` config field.

* Set Subway per-upstream `request_timeout_seconds` to 90s

Subway's default per-upstream request timeout is 30s. With three Acala
public RPC endpoints, heavy storage queries that take longer than 30s
cause Subway to cycle through all three endpoints (~90s) before any
single upstream has a chance to respond, and the test-side waiting
client times out.

`request_timeout_seconds` was added to `ClientConfig` in
AcalaNetwork/subway#203 (Subway v0.1.1+). Setting it to 90 lets a
single upstream attempt run long enough to complete those queries
instead of being preempted by Subway's own per-endpoint clock.

The companion exclusion of Acala tests in `vitest.config.mts` is
intentionally left in place; this commit only restores Subway's
ability to wait long enough. Lifting the exclusion is a separate
verification step.

* Re-enable Acala test suites

`request_timeout_seconds: 90` on Subway's upstream client (added to
`subway-template.yml` in the previous commit) gives Subway enough
time per upstream attempt for Acala storage queries to land before
the 30s default forced it to cycle endpoints. The exclusion added in
PR #621 is no longer needed and is removed; the exclusion comment is
narrowed to bifrostKusama, which still lacks a workable endpoint set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants